Ngo et al. (2019) originally investigated the ontogeny of holistic episodic memory recollection in children and younger adults. The authors sought to clarify age-dependency for memory mechanisms allowing individuals to holistically retrieve all elements of a previously encoded visual memory when presented with partial cues. This memory mechanism, generally referred to as pattern completion, is a neural computation supported by the hippocampus reactivating complete episodic memory representations with only a partial cue (Marr, 1971; McClelland et al., 1995; Norman & O’Reilly, 2003). The primary dependent variable in the original study was retrieval dependency (i.e., the coherence of within-event retrieval being mutually contingent: all accurate or inaccurate) for previously encoded multielement event stimuli each containing a scene, a person, and an object. Following the encoding phase, participants engaged in a self-paced four-alternative forced-choice cued recognition test. The retrieval dependency measure served as the index for assessing pattern completion. The authors observed retrieval dependency in all three age groups (4-year-olds, 6-year-olds, younger adults), and that retrieval dependency was disassociated from overall retrieval accuracy.
In the present replication, I aim to replicate the younger adult finding in a more diverse sample of younger adults via the Prolific online recruitment platform (Prolific.co). This replication is important for my specific research program as I am ultimately aiming to build biologically plausible computation models of selective episodic memory retrieval mechanisms, which heavily relies on the neural computation of pattern completion in the hippocampus. Here, the same stimuli will be used with the same procedure of the original study. An example of the stimuli are presented below in Figure 1.
Fig. 1: Example stimuli from Ngo et al. (2019).
Project repository (on Github): https://github.com/psych251/ngo2019
Original paper (as hosted in your repo): https://github.com/psych251/ngo2019/blob/main/original_paper/ngoetal2019.pdf
Experimental paradigm: https://psych.shawntylerschwartz.com/ngo
library(pwr)
## effect sizes
d_original <- 1.32
d_large <- .80
## power levels to test
powers <- c(.80, .90, .95)
## power analyses for t-tests with effect size reported in original paper
power_80_original <- pwr.t.test(d = d_original, power = powers[1], type = "one.sample", alternative = "greater")
power_90_original <- pwr.t.test(d = d_original, power = powers[2], type = "one.sample", alternative = "greater")
power_95_original <- pwr.t.test(d = d_original, power = powers[3], type = "one.sample", alternative = "greater")
## power analyses for t-tests with relatively large effect size (to be conservative)
power_80_large <- pwr.t.test(d = d_large, power = powers[1], type = "one.sample", alternative = "greater")
power_90_large <- pwr.t.test(d = d_large, power = powers[2], type = "one.sample", alternative = "greater")
power_95_large <- pwr.t.test(d = d_large, power = powers[3], type = "one.sample", alternative = "greater")
## function to automate printing of power analysis results
print_power_results <- function(results) {
print(paste(round(results$n), "participants will be needed to achieve", results$power, "power for an effect size of d =", results$d))
}
print_power_results(power_80_original)
## [1] "5 participants will be needed to achieve 0.8 power for an effect size of d = 1.32"
print_power_results(power_90_original)
## [1] "7 participants will be needed to achieve 0.9 power for an effect size of d = 1.32"
print_power_results(power_95_original)
## [1] "8 participants will be needed to achieve 0.95 power for an effect size of d = 1.32"
print_power_results(power_80_large)
## [1] "11 participants will be needed to achieve 0.8 power for an effect size of d = 0.8"
print_power_results(power_90_large)
## [1] "15 participants will be needed to achieve 0.9 power for an effect size of d = 0.8"
print_power_results(power_95_large)
## [1] "18 participants will be needed to achieve 0.95 power for an effect size of d = 0.8"
Based on the power analyses for a large effect size (Cohen’s d = .80), I will collect data from 12 participants and data collection will stop once complete data from all 12 participants has been collected from Prolific (Prolific.co). The experiment should take about 15-20 minutes to complete.
The materials from the original article were followed precisely; they are indicated below as quoted directly from the original authors (Ngo et al., 2019):
‘We sampled 24 cartoon images of distinct scenes (12 indoor scenes, e.g., an aquarium; 12 outdoor scenes, e.g., a playground), 24 cartoon images of common objects (e.g., a watch), and 24 images of cartoon characters from nonoverlapping movies or books (12 males, e.g., Pinocchio; 12 females, e.g., Alice) from the Google Images search engine. From this pool of selected images, we then constructed 24 “events,” each consisting of a scene (e.g., an aquarium), a person (e.g., Alice), and an object (e.g., a wallet). The event assignment of the elements was randomized, with the exception that items with preexperimental associations (e.g., books and library) were not assigned to the same event. Every possible cue–test combination of each event was tested, resulting in six test trials per event (1 = cue: scene, test: person; 2 = cue: scene, test: object; 3 = cue: person, test: scene; 4 = cue: person, test: object; 5 = cue: object, test: scene; 6 = cue: object, test: person) and totaling 144 test trials.’
The procedure from the original article was originally designed for children, with modifications for a young adult manipulation. Here, I followed the original procedure for the young adults precisely. Though, the original text from the article first describes the children’s version of the task and then provides the key changes for the young adult version afterward; I have taken the same approach by quoting directly the entire procedure section from the original authors (Ngo et al., 2019). Additionally, the original authors indicated that all participants were tested on a 13-in. laptop screen; however, given the remote nature of this study, I cannot ensure the size of screen each young adult participant will engage with the task on. I will therefore limit participants to complete this experiment on a laptop or desktop computer (i.e., no phone or tablet) to ensure maximum compatibility and relatively standard screen sizes.
‘All participants were tested individually. The task procedure administered to children consisted of two encoding-test blocks, which occurred immediately after one another. Each block consisted of 12 encoding and 72 test trials, all presented on a 13-in. laptop screen. Prior to encoding, participants were told that they would see many different stories and that they should pay close attention to all of the different elements, including the scene, person, and object in each story. Then, participants viewed a series of events (12 s each; 0.5 s intertrial interval). A short audio-recorded narrative accompanied each event (e.g., “Alice went to the aquarium, but she dropped her wallet there; the wallet was lost in the aquarium”; see Fig. [2]a). Each narrative consisted of three sentences, with each sentence highlighting one pairwise association within the event. The order of the pairwise associations within each narrative was not fixed or counterbalanced across the events. The narrative was constructed this way to engage children in the task and to increase the likelihood that children would pay attention to all of the elements in an event. Prior to encoding, we provided one example (a playground, Elastigirl, a hat) in order to acquaint the participants with the encoding task.’
Fig. 2: “Procedure of the child (a) and adult (b) multielement-event task. In the child task, participants viewed 24 events presented in two encoding sessions, each consisting of 12 events. Each event lasted 12 s and was accompanied by an audio-recorded narrative. The test phase of each block consisted of 72 test trials. In the adult task procedure, participants studied 24 events (6 s each) together and without the recorded narrative. The test phase consisted of 144 test trials. Note that the characters shown in each event were well-known cartoon characters (e.g., Alice, Pinocchio), which have been replaced in this illustration for copyright concerns.”
‘Immediately after the encoding phase of each block, participants performed a self-paced four-alternative forced-choice task. We tested participants on every possible cue–retrieval combination of each studied event, resulting in 6 test trials per event, which totaled 72 test trials per block. On each trial, a cue and four options were presented simultaneously on the screen (see Fig. [3]a). Among four options, one was a target—the correct item because it belonged to the same event as the cue. The three lures were same-category elements from different events. The lures always came from the events that contained same-sex characters, so that participants could not eliminate lures on the basis of general mnemonic heuristics (e.g., remembering that there was a female character who went to the aquarium). Across all 24 events, any two test trials that had overlapping cue items (e.g., AB and AC) or in which tested items (e.g., BA and CA) shared only one foil item (out of three) with respect to their event membership. For example, for the AB test trial of Event 1, the foils included the B elements from Events 2, 3, and 4, whereas for the AC trial of Event 1, the foils included the C elements from Events 3, 5, and 7 (one B and one C foil, both from Event 3). Furthermore, all items served as foils an equal number of times across all 144 test trials. Children were asked to point to one of the four options that belonged to the same story as the cue on the left side of the screen. Positions of the correct answer were counterbalanced across the entire test phase. There were no missing responses, as the response time was unrestricted. The memory task took approximately 40 min.’
Fig. 3: “A schematic depiction of the task design and the 2 × 2 contingency table used to estimate retrieval dependency. Examples of six retrieval types per event in the test phase are shown in (a). Each element of a studied event took a turn serving as the cue (item presented on the left side of the screen) and the tested element (one of the four options presented inside the red box). The schematic (b) shows how the proportion of joint retrieval for AB and AC pairs was computed for each participant. The contingency table shows the proportion of events that fell within each of the four categories: Both AB and AC pairs were retrieved correctly, both AB and AC pairs were retrieved incorrectly, AB was retrieved correctly and AC was retrieved incorrectly, and AB was retrieved incorrectly and AC was retrieved correctly. The proportion of events in the blue-outlined boxes (both pairs correct and both pairs incorrect) were added, and the sum was divided by the total number of events. Note that the characters shown in each event were well-known cartoon characters (e.g., Alice, Pinocchio), which have been replaced in this illustration for copyright concerns.”
‘The adult task procedure was similar to the child task procedure but with a few differences. First, the whole procedure was administered in a single session comprising 24 encoding events and 144 test trials. Second, no narratives were implemented at the encoding phase to avoid potential ceiling performance in young adults. Third, each encoding trial was presented for 6 s (see Fig. [2]b).’
Participants will be excluded if they perform at ceiling on the memory task (i.e., with 100% accuracy).
The key analyses of interest will be first testing for retrieval dependency via a one-sample t-test to determine whether retrieval dependency (observed data – an independent model) exceeds zero. This analysis mimics the key analysis of the original study (Ngo et al., 2019), described below by a direct quote from the original authors:
‘The retrieval dependency between retrieval successes for different associations within the same event was computed using the same methods as in previous studies (Bisby et al., 2018; Horner et al., 2015; Horner & Burgess, 2013, 2014). Six 2 × 2 contingency tables for the data and the predicted independent model were computed for each participant on the basis of their retrieval accuracy for each pairwise association in order to assess dependency between retrieving two elements when cued by the remaining common element within an event (ABAC; i.e., cue with A and retrieve B, and cue with A and retrieve C), and the dependency between retrieving a common item when cued by the other two elements within an event (BACA; i.e., cue with B and retrieve A, and cue with C and retrieve A). Each 2 × 2 contingency table for the data for every participant shows the proportion of events that fall within the four categories: Both AB and AC are correct or incorrect, AB is correct and AC is incorrect, and AC is correct and AB is incorrect. To examine retrieval dependency, we computed the proportion of joint retrieval for the data, defined as the proportion of events in which both associations were either correctly or incorrectly retrieved (Cells 1,1 and 2,2 of each contingency table; see Fig. [3]b). We then averaged this measure across six contingencies tables (three tables for the ABAC analysis for each element type and three tables for the BACA analysis for each element type) for each participant.’
‘The independent model of retrieval estimated the degree of statistical dependency if retrieval success for specific cue–test pairs (cue: person, test: scene) was independent of retrieval success of other cue–test pairs (cue: person, test: object) in relation to participants’ overall accuracy. The independent model predicted the proportion of joint retrieval given a participant’s overall level of performance if retrievals of event pairs were independent such that the probability of the successful retrieval for both, for example, AB and AC was equal to PAB × PAC, where PAB was the probability of retrieving B when cued by A across all events, and similarly for PAC (see Fig. [4] for full details). The proportion of joint retrieval for the independent model (calculated in the same manner as described above) served as a predicted baseline for which we compared the proportion of joint retrieval in the data. Given that the proportion of joint retrieval for the data scaled with accuracy, the main index of retrieval dependency was the difference between the proportion of joint retrieval in the data and independent model for each participant—referred to as dependency. If this dependency measure (data – independent model) was significantly greater than zero, this provided evidence for significant retrieval dependency (for the same approach, see Horner & Burgess, 2013, 2014). In addition, we took the magnitude of dependency to signify the extent of holistic retrieval.’
Fig. 4: “Contingency table for the predicted independent model for the proportion of correct and incorrect cued recognition over the total number of events for elements B and C when cued by A. PAB denotes the probability of retrieving B when cued by A. The proportion of joint retrieval for the independent model is calculated by summing the correct-correct and the incorrect-incorrect cells and dividing by the sum of all four cells.”
Additionally, similar to the original study, I will also check for sex differences on overall retrieval accuracy. In addition to the analyses reported in the original study, I will use a linear modeling approach to investigate whether there appears to be any age effects for retrieval accuracy and retrieval dependency. In addition to the typical frequentist statistics, Bayes factors will be reported for each analysis described above.
## Load Relevant Libraries and Functions
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(report)
## Import data
### get all data files from directory
data_files <- list.files(file.path("data/full_run"), pattern = "*.csv", full.names = TRUE)
data_files
## [1] "data/full_run/1.csv" "data/full_run/10.csv" "data/full_run/11.csv"
## [4] "data/full_run/12.csv" "data/full_run/2.csv" "data/full_run/3.csv"
## [7] "data/full_run/4.csv" "data/full_run/5.csv" "data/full_run/6.csv"
## [10] "data/full_run/7.csv" "data/full_run/8.csv" "data/full_run/9.csv"
### read individual data files to list
all_data <- lapply(data_files, read.csv)
### unpack list as singular tidy data frame
all_data <- all_data |>
map_df(as_tibble)
The key differences between the original study and this formally preregistered replication are that only younger adults will be tested (not children), and a much smaller sample size of younger adults (not specifically undergraduate students; n = 12) will be used (compared to the original n = 31 undergraduate students; 18 female, M = 20.65 years, SD = 3.23, range 18-31). The population in which participants will be sampled from here is Prolific.co, and participants will be paid $3.34 for their participation. This deviates from the original study, which sampled from the undergraduate student population at Temple University, where students participated for partial course credit. In both the original student and my replication, participants will have normal or corrected-to-normal vision. Additionally, I am imposing the prescreening requirements such that the current country of residence is the United States, the minimum age for participation is 18 and the maximum age is 30 years (inclusive), and that participants’ first language is English. I am also specifically targeting a 50/50 even-split between male- and female-identifying participants via Prolific’s participant recruiting feature. Furthermore, I will not be having the younger adult participants complete a verbal intelligence task after the primary memory task like was done in the original study. Thus, in this replication, younger adult participants will only be completing the encoding and retrieval portion of the memory task reported on in the original published paper. I do not anticipate these differences to make a difference on the expected results given the large effect size (Cohen’s d = 1.32) reported in the original article (Ngo et al., 2019).
### unpack participant demographics
demographics_trials <- all_data |>
subset(trial_type == "survey-html-form")
### string together json strings
json_text <- paste0(demographics_trials$response, collapse = ',')
json_text <- paste0('[', json_text, ']')
### unpack json as data frame
demographics_trials <- jsonlite::fromJSON(json_text) |>
as.tibble() |>
mutate(subject_id = rownames(demographics_trials), .before = "Gender")
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
### get participant demographic stats
#### age
mean_age <- mean(as.numeric(demographics_trials$Age))
sd_age <- sd(as.numeric(demographics_trials$Age))
min_age <- min(as.numeric(demographics_trials$Age))
max_age <- max(as.numeric(demographics_trials$Age))
#### gender
table(demographics_trials$Gender)
##
## Female Male PreferNotToAnswer
## 6 5 1
The demographic composition of the participants I recruited online varied slightly from those in the original study, n = 12, M = 23.25 years, SD = 2.8, range: 18-27; 6 females, 5 males, and 1 individual who preferred not to answer (compared to the original n = 31 undergraduate students; 18 female, M = 20.65 years, SD = 3.23, range 18-31; Ngo et al. (2019)). Participants recruited via Prolific.co for this replication attempt were recruited with the following prescreening criteria:normal or corrected-to-normal vision, currently reside in the United States, English as the first language, ages 18-30 (inclusive), did not participate in Pilot A or B. I used the sex-matching feature on Prolific.co to aim for an equal distribution of sex within the recruited online sample. Furthermore, no participants in this replication reported colorblindness, and no participants were excluded for having 100% accuracy on the 4-alternative-forced-choice cued recognition test.
The only difference that was not originally planned was the addition of the descriptive statistics and a plot for the proportion of joint retrieval for the data and independent models. This has been added below and the plot is included with the Retrieval Dependency and Overall Accuracy boxplots below.
## Data exclusion / filtering
### clean data into long format
all_data_cleaned <- all_data |>
filter(!is.na(correct)) |> # pull out unnecessary rows for analysis
select(subject_id, trial_index, time_elapsed, rt, stimulus, task,
retrieval_group, response, correct_response, correct) |> # select relevant columns
mutate(accuracy = ifelse(correct == "TRUE", 1, 0)) |>
separate(stimulus, c(NA, NA, NA, NA, "id"), sep = "_", remove = FALSE) # get each retrieval image id independent of the cue grouping
head(all_data_cleaned)
## # A tibble: 6 × 12
## subject_id trial_index time_elapsed rt stimulus id task retrieval_group
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 54 294132 2003 img/tes… 18.j… resp… Ca
## 2 1 55 296431 2286 img/tes… 8.jpg resp… Ca
## 3 1 56 300358 3924 img/tes… 12.j… resp… Ca
## 4 1 57 303378 3019 img/tes… 1.jpg resp… Ca
## 5 1 58 305916 2535 img/tes… 17.j… resp… Ca
## 6 1 59 309212 3294 img/tes… 19.j… resp… Ca
## # … with 4 more variables: response <chr>, correct_response <chr>,
## # correct <lgl>, accuracy <dbl>
## Prepare data for analysis - create columns etc.
### compute individual accuracy by grouping
accuracy_summary <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
group_by(retrieval_group) |>
summarise(mean = mean(accuracy), sd = sd(accuracy), n = n(), sem = sd/sqrt(n))
### compute accuracy by participant
accuracy_summary_part <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
group_by(subject_id) |>
summarise(mean = mean(accuracy), sd = sd(accuracy), n = n(), sem = sd/sqrt(n))
### compute overall accuracy
overall_accuracy_summary <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
summarise(mean = mean(accuracy), sd = sd(accuracy), n = n(), sem = sd/sqrt(n))
### compute Ab_Ac_Accuracy
Ab_Ac_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ab" | retrieval_group == "Ac") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute Ba_Bc_Accuracy
Ba_Bc_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ba" | retrieval_group == "Bc") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute Ca_Cb_Accuracy
Ca_Cb_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ca" | retrieval_group == "Cb") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute Ba_Ca_Accuracy
Ba_Ca_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ba" | retrieval_group == "Ca") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute Ac_Bc_Accuracy
Ac_Bc_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ac" | retrieval_group == "Bc") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute Ab_Cb_Accuracy
Ab_Cb_Accuracy <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
filter(retrieval_group == "Ab" | retrieval_group == "Cb") |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
### compute 6 2x2 contingency tables for retrieval dependency key analysis
#### define dependent model function
compute_data_model <- function(group_one, group_two, num_events = 24) {
test_set <- all_data_cleaned |>
filter(task == "response") |> # only use retrieval data
select(subject_id, id, retrieval_group, accuracy) |> # select only the necessary columns
filter(retrieval_group == group_one | retrieval_group == group_two) |> # filter out relevant groups from function params
pivot_wider(id_cols = c(subject_id, id), names_from = retrieval_group, values_from = accuracy) |> # make long data frame wide for data dependent calculations
rowwise() |>
mutate(sum = sum(eval(parse(text = group_one)) + eval(parse(text = group_two)))) # get sums to compute all_correct and all_incorrect proportion of contingency table
# get unique participants
ps <- unique(all_data_cleaned$subject_id)
data_models <- rep(NA, length(ps))
for(ii in 1:length(ps)) {
# get data for just the one participant in the loop
data_subset <- test_set |>
filter(subject_id == ps[ii])
# get proportion of all correct
prop_all_correct <- sum(data_subset$sum == 2) / num_events
# get proportion of all incorrect
prop_all_incorrect <- sum(data_subset$sum == 0) / num_events
# compute data model and store per participant
data_model_calc <- prop_all_correct + prop_all_incorrect
data_models[ii] <- data_model_calc
}
return(data_models)
}
#### 1) compute Data_Ab_Ac
Data_Ab_Ac <- compute_data_model("Ab", "Ac")
#### 2) compute Data_Ba_Bc
Data_Ba_Bc <- compute_data_model("Ba", "Bc")
#### 3) compute Data_Ca_Cb
Data_Ca_Cb <- compute_data_model("Ca", "Cb")
#### 4) compute Data_Ba_Ca
Data_Ba_Ca <- compute_data_model("Ba", "Ca")
#### 5) compute Data_Ac_Bc
Data_Ac_Bc <- compute_data_model("Ac", "Bc")
#### 6) compute Data_Ab_Cb
Data_Ab_Cb <- compute_data_model("Ab", "Cb")
#### define independent model function
compute_ind_model <- function(group_one, group_two) {
P_AB <- all_data_cleaned |>
filter(retrieval_group == group_one) |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
P_AC <- all_data_cleaned |>
filter(retrieval_group == group_two) |>
group_by(subject_id) |>
summarise(mean = mean(accuracy))
cor_cor <- P_AB$mean * P_AC$mean
incor_cor <- P_AC$mean * (1 - P_AB$mean)
cor_incor <- P_AB$mean * (1 - P_AC$mean)
incor_incor <- (1 - P_AB$mean) * (1 - P_AC$mean)
return(cor_cor + incor_incor)
}
#### 1) compute Independent_Model_Ab_Ac
Independent_Model_Ab_Ac <- compute_ind_model("Ab", "Ac")
#### 2) compute Independent_Model_Ba_Bc
Independent_Model_Ba_Bc <- compute_ind_model("Ba", "Bc")
#### 3) compute Independent_Model_Ca_Cb
Independent_Model_Ca_Cb <- compute_ind_model("Ca", "Cb")
#### 4) compute Independent_Model_Ba_Ca
Independent_Model_Ba_Ca <- compute_ind_model("Ba", "Ca")
#### 5) compute Independent_Model_Ac_Bc
Independent_Model_Ac_Bc <- compute_ind_model("Ac", "Bc")
#### 6) compute Independent_Model_Ab_Cb
Independent_Model_Ab_Cb <- compute_ind_model("Ab", "Cb")
#### 1) compute Dependency_AbAc
Dependency_AbAc <- Data_Ab_Ac - Independent_Model_Ab_Ac
#### 2) compute Dependency_BaBc
Dependency_BaBc <- Data_Ba_Bc - Independent_Model_Ba_Bc
#### 3) compute Dependency_CaCb
Dependency_CaCb <- Data_Ca_Cb - Independent_Model_Ca_Cb
#### 4) compute Dependency_BaCa
Dependency_BaCa <- Data_Ba_Ca - Independent_Model_Ba_Ca
#### 5) compute Dependency_AcBc
Dependency_AcBc <- Data_Ac_Bc - Independent_Model_Ac_Bc
#### 6) compute Dependency_AbCb
Dependency_AbCb <- Data_Ab_Cb - Independent_Model_Ab_Cb
### compute Collapsed_Data
total_data_model <- (Data_Ab_Ac + Data_Ba_Bc + Data_Ca_Cb + Data_Ba_Ca + Data_Ac_Bc + Data_Ab_Cb) / 6
### compute Collapsed_Ind_Model
total_ind_model <- (Independent_Model_Ab_Ac + Independent_Model_Ba_Bc + Independent_Model_Ca_Cb + Independent_Model_Ba_Ca + Independent_Model_Ac_Bc + Independent_Model_Ab_Cb) / 6
### compute Dependency
Dependency <- total_data_model - total_ind_model
Dependency
## [1] -0.04050926 0.06539352 0.07175926 -0.03298611 0.03240741 0.06655093
## [7] 0.19791667 -0.33506944 -0.36574074 0.01967593 0.44097222 0.30034722
Dependency_avg <- mean(Dependency)
Dependency_avg
## [1] 0.0350598
## construct complete data frame with all relevant stats summarized for each participant
summary_data <- data.frame(subject_id = accuracy_summary_part$subject_id,
Accuracy_mean = accuracy_summary_part$mean,
Accuracy_sd = accuracy_summary_part$sd,
Accuracy_sem = accuracy_summary_part$sem,
Accuracy_n = accuracy_summary_part$n,
Accuracy_AbAc = Ab_Ac_Accuracy$mean,
Accuracy_BaBc = Ba_Bc_Accuracy$mean,
Accuracy_CaCb = Ca_Cb_Accuracy$mean,
Accuracy_BaCa = Ba_Ca_Accuracy$mean,
Accuracy_AcBc = Ac_Bc_Accuracy$mean,
Accuracy_AbCb = Ab_Cb_Accuracy$mean,
Data_AbAc = Data_Ab_Ac,
Data_BaBc = Data_Ba_Bc,
Data_CaCb = Data_Ca_Cb,
Data_BaCa = Data_Ba_Ca,
Data_AcBc = Data_Ac_Bc,
Data_AbCb = Data_Ab_Cb,
Independent_AbAc = Independent_Model_Ab_Ac,
Independent_BaBc = Independent_Model_Ba_Bc,
Independent_CaCb = Independent_Model_Ca_Cb,
Independent_BaCa = Independent_Model_Ba_Ca,
Independent_AcBc = Independent_Model_Ac_Bc,
Independent_AbCb = Independent_Model_Ab_Cb,
Dependency_AbAc = Dependency_AbAc,
Dependency_BaBc = Dependency_BaBc,
Dependency_CaCb = Dependency_CaCb,
Dependency_BaCa = Dependency_BaCa,
Dependency_AcBc = Dependency_AcBc,
Dependency_AbCb = Dependency_AbCb,
Data_model_overall = total_data_model,
Independent_model_overall = total_ind_model,
Dependency_overall = Dependency)
DT::datatable(summary_data)
## count number of participants to exclude for having overall task accuracy == 100%
perfect_acc_exclusions_idx <- which(summary_data$Accuracy_mean == 1)
if(length(perfect_acc_exclusions_idx) > 0) {
summary_data_w_exclusions <- summary_data[perfect_acc_exclusions_idx,]
} else {
print("No exclusions based on accuracy made!")
}
## [1] "No exclusions based on accuracy made!"
num_acc_exclusions <- length(perfect_acc_exclusions_idx)
num_acc_exclusions
## [1] 0
### testing key confirmatory analysis using pilot A (n = 3) data
key_test <- t.test(Dependency, alternative = "greater", mu = 0)
I conducted a one-sample t-test to determine whether dependency (data – independent model) exceeded zero for young adults, which indicated that there was no significant difference for retrieval dependency scores greater than the test value of zero, t(11) = 0.53, p = 0.303, Cohen’s d = 0.15, 95% confidence interval (CI) = [-0.34, Inf], M = 0.04, SD = 0.23, SEM = 0.07 95% CI = [-0.09, 0.16]. Original data from (Ngo et al., 2019): M = .07, SD = .05, SEM = .01, 95% CI = [.05, .08].
## compute 95% CI of acc mean assuming normal distribution
acc_mean <- mean(summary_data$Accuracy_mean)
acc_sd <- sd(summary_data$Accuracy_mean)
n <- 12
error <- qnorm(0.975)*acc_sd/sqrt(n)
lower_ci <- acc_mean - error
upper_ci <- acc_mean + error
‘Overall accuracy was defined as the proportion of target selection across 144 test trials.’ (Ngo et al., 2019). Overall accuracy was at 0.56 (SD = 0.22, SEM = 0.06, 95% CI = [0.43, 0.68]). Original data from (Ngo et al., 2019): M = .72, SD = .19, SEM = .03, 95% CI = [.05, .08].
The proportion of joint retrieval for the data was at 0.62 (SD = 0.14, SEM = 0.04, 95% CI = [0.54, 0.71]). Original data from (Ngo et al., 2019) for the retrieval data: M = .72, SD = .13, SEM = .02, 95% CI = [.67, .76].
The proportion of joint retrieval for the independent model was at 0.59 (SD = 0.16, SEM = 0.05, 95% CI = [0.5, 0.68]). Original data from (Ngo et al., 2019) for independent model: M = .65, SD = .14, SEM = .02, 95% CI = [.60, .70].
Fig. 5a: Original key data figures from Ngo et al. (2019). The current replication attempt was only interested in the young adult population (i.e., the elements of the figures highlighted by red bounding boxes).
### Overall Accuracy
#### make plot
acc_plot <- ggplot(summary_data, aes(x = "", y = Accuracy_mean)) +
geom_boxplot(width = 0.15, fill = "#4f50a2", outlier.shape = NA) +
geom_jitter(width = 0, col = "#363986", alpha = .7, shape = 21, size = 1, stroke = 1) +
xlab("Young Adults") +
ylab("Overall Accuracy") +
scale_y_continuous(breaks = seq(0.25, 1.00, 0.25), limits = c(0.25, 1.00)) +
theme_classic() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
legend.title = element_blank(),
legend.text = element_text(color = "grey20", size = 12),
axis.text.x = element_text(color = "grey20", size = 16),
axis.text.y = element_text(color = "grey20", size = 12),
axis.title.x = element_text(color = "grey20", size = 12, face = "bold"),
axis.title.y = element_text(color = "grey20", size = 12, vjust = 0.5, face = "bold"))
### Proportion of Joint Retrieval
#### get relevant data in long format
summary_data_joint_ret_long <- pivot_longer(summary_data,
c(Data_model_overall, Independent_model_overall),
names_to = "source",
values_to = "prop_joint_retrieval") |>
rowwise() |>
mutate(source_cleaned = ifelse(source == "Data_model_overall", "Data", "Independent Model")) |>
select(subject_id, source, source_cleaned, prop_joint_retrieval)
#### make plot
ret_plot <- ggplot(summary_data_joint_ret_long, aes(x = source_cleaned, y = prop_joint_retrieval, fill = source_cleaned)) +
geom_boxplot(width = 0.15, outlier.shape = NA, position = position_dodge(1)) +
geom_jitter(aes(col = source_cleaned, x = source_cleaned), position = position_jitter(width = 0), alpha = .7, shape = 21, size = 1, stroke = 1) +
xlab("Young Adults") +
ylab("Proportion of Joint\nRetrieval") +
scale_y_continuous(breaks = seq(0.40, 1.00, 0.1), limits = c(0.40, 1.00)) +
scale_fill_manual(values = c("#ad93ad", "#dbb5d6")) +
scale_color_manual(values = c("#5c475b", "#573350")) +
theme_classic() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
legend.title = element_blank(),
legend.text = element_text(color = "grey20", size = 10),
legend.position = "none",
axis.text.x = element_text(color = "grey20", size = 12),
axis.text.y = element_text(color = "grey20", size = 10),
axis.title.x = element_text(color = "grey20", size = 12, vjust = -0.5, face = "bold"),
axis.title.y = element_text(color = "grey20", size = 12, vjust = 0.5, face = "bold"))
### Dependency
#### make plot
dep_plot <- ggplot(summary_data, aes(x = "", y = Dependency_overall)) +
geom_boxplot(width = 0.15, fill = "#4f50a2", outlier.shape = NA) +
geom_jitter(width = 0, col = "#363986", alpha = .7, shape = 21, size = 1, stroke = 1) +
geom_hline(yintercept = 0.0, col = "red", lty = "dashed") +
xlab("Young Adults") +
ylab("Retrieval Dependency") +
scale_y_continuous(breaks = seq(-0.50, 0.50, 0.1), limits = c(-0.50, 0.50)) +
theme_classic() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
legend.title = element_blank(),
legend.text = element_text(color = "grey20", size = 12),
axis.text.x = element_text(color = "grey20", size = 16),
axis.text.y = element_text(color = "grey20", size = 12),
axis.title.x = element_text(color = "grey20", size = 12, face = "bold"),
axis.title.y = element_text(color = "grey20", size = 12, vjust = 0.5, face = "bold"))
### Combine Plots for Rendered Report
ggpubr::ggarrange(acc_plot, ret_plot, dep_plot, ncol = 2, nrow = 2, widths = c(1,1), heights = c(2,2))
Fig. 5b: Attempted replication of key data from the current study. The dashed red line on the “Retrieval Dependency” figure indicates the test value of 0 used in the t-test (i.e., retrieval dependency scores significantly greater than 0).
### pair age data with dependency and accuracy data
exploratory_analysis <- merge(summary_data, demographics_trials, by = "subject_id")
### run linear model of age (years) predicting summary dependency scores (n = 12 measures)
dep_lm <- lm(Dependency_overall ~ as.numeric(Age), data = exploratory_analysis)
summary(dep_lm) ## not significant
##
## Call:
## lm(formula = Dependency_overall ~ as.numeric(Age), data = exploratory_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43083 -0.07790 0.01985 0.12440 0.37588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59365 0.57652 1.030 0.327
## as.numeric(Age) -0.02403 0.02463 -0.975 0.352
##
## Residual standard error: 0.2288 on 10 degrees of freedom
## Multiple R-squared: 0.08686, Adjusted R-squared: -0.00445
## F-statistic: 0.9513 on 1 and 10 DF, p-value: 0.3524
### run linear model of age (years) predicting summary accuracy scores (n = 12 measures)
acc_lm <- lm(Accuracy_mean ~ as.numeric(Age), data = exploratory_analysis)
summary(acc_lm) ## not significant
##
## Call:
## lm(formula = Accuracy_mean ~ as.numeric(Age), data = exploratory_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27236 -0.12399 -0.06129 0.10227 0.39704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.461795 0.584820 0.790 0.448
## as.numeric(Age) 0.004207 0.024988 0.168 0.870
##
## Residual standard error: 0.2321 on 10 degrees of freedom
## Multiple R-squared: 0.002826, Adjusted R-squared: -0.09689
## F-statistic: 0.02834 on 1 and 10 DF, p-value: 0.8697
### get all individual dependency scores by block (n = 12 * 6 blocks = 72 measures)
all_dep_indv_scores <- exploratory_analysis |>
select(subject_id, Dependency_AbAc:Dependency_AbCb, Age) |>
pivot_longer(starts_with("Dependency_"), names_to = "Block", values_to = "Score")
### run linear model of age (years) predicting all individual dependency scores by block (n = 72 measures)
dep_indv_lm <- lm(Score ~ as.numeric(Age), data = all_dep_indv_scores)
summary(dep_indv_lm) ## significant
##
## Call:
## lm(formula = Score ~ as.numeric(Age), data = all_dep_indv_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56856 -0.11953 0.00309 0.15481 0.45574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59365 0.23967 2.477 0.0157 *
## as.numeric(Age) -0.02403 0.01024 -2.346 0.0218 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.233 on 70 degrees of freedom
## Multiple R-squared: 0.0729, Adjusted R-squared: 0.05966
## F-statistic: 5.504 on 1 and 70 DF, p-value: 0.02181
### make plot
ggplot(all_dep_indv_scores, aes(x = as.numeric(Age), y = Score)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
xlab("Chronological Age (years)") +
ylab("Retrieval Dependency") +
scale_x_continuous(breaks = seq(18, 27, 1), limits = c(18, 27)) +
scale_y_continuous(breaks = seq(-0.50, 0.50, 0.1), limits = c(-0.50, 0.50)) +
labs(title = paste("R^2 = ",signif(summary(dep_indv_lm)$r.squared, 1),
", Intercept =",signif(dep_indv_lm$coef[[1]], 1),
", Slope =",signif(dep_indv_lm$coef[[2]], 1),
", p =",signif(summary(dep_indv_lm)$coef[2,4], 2))) +
theme_classic() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
legend.title = element_blank(),
legend.text = element_text(color = "grey20", size = 12),
axis.text.x = element_text(color = "grey20", size = 16),
axis.text.y = element_text(color = "grey20", size = 12),
axis.title.x = element_text(color = "grey20", size = 12, face = "bold"),
axis.title.y = element_text(color = "grey20", size = 12, vjust = 0.5, face = "bold"))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
### get all individual accuracy scores by block (n = 12 * 6 blocks = 72 measures)
all_acc_indv_scores <- exploratory_analysis |>
select(subject_id, Accuracy_AbAc:Accuracy_AbCb, Age) |>
pivot_longer(starts_with("Accuracy_"), names_to = "Block", values_to = "Score")
### run linear model of age (years) predicting all individual accuracy scores by block (n = 72 measures)
acc_indv_lm <- lm(Score ~ as.numeric(Age), data = all_acc_indv_scores)
summary(acc_indv_lm) ## not significant
##
## Call:
## lm(formula = Score ~ as.numeric(Age), data = all_acc_indv_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35443 -0.15872 -0.04593 0.16130 0.42882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.461795 0.233581 1.977 0.052 .
## as.numeric(Age) 0.004207 0.009980 0.422 0.675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.227 on 70 degrees of freedom
## Multiple R-squared: 0.002532, Adjusted R-squared: -0.01172
## F-statistic: 0.1777 on 1 and 70 DF, p-value: 0.6747
I found no significant difference between young adult participants’ retrieval dependency scores from 0, thus failing to replicate the original effect (Figure 5b). Retrieval dependency scores in the positive direction (i.e., greater than 0) are indicative of retrieval dependency, or the principle that holistic event retrieval success via hippocampal pattern completion is contingent on the retrieval success of the other elements encoded in the episodic memory trace (see Figure 5a). Not finding a significant effect of retrieval dependency indicates that accuracy of episodic memory recollection was not mutually contingent for within-event test trials.
Further inspection of the data indicate that 8 of 12 participants in the current replication attempt had retrieval dependency scores greater than 0. This is somewhat hopeful in that only 1/3 of participants in the current study had negative retrieval dependency (i.e., mutually contingent accuracy for forgetting). In the original study, 31 young adult undergraduate students participated in this task, in a controlled, in-person laboratory environment (Ngo et al., 2019). There, 29 out of 31 participants had retrieval dependency scores greater than 0. Considering the large effect size (Cohen’s d = 1.32) found in the original study (Ngo et al., 2019), and that my a priori power analysis suggested that 11 participants were needed to achieve 80% power for an effect size of Cohen’s d = 0.80, I was surprised that I did not find this strong effect in my attempted replication. Although none of the participants in my online replication attempt via Prolific.co reported cheating on the task or experiencing technical difficulties, it is possible that the experience of completing the task in an uncontrolled environment online negatively influenced the results away from the expected strong retrieval dependency effect.
Hippocampal pattern completion is well supported in the literature as the neural mechanism for holistic episodic memory retrieval ability (Marr, 1971; McClelland et al., 1995; Norman & O’Reilly, 2003). It is therefore unlikely that this failed replication attempt signifies issues with contemporary theories of episodic memory recollection. Rather, the current paradigm, experimental parameters, and metrics used in the original study by (Ngo et al., 2019) likely reflect an upper-bound on performance and experimental context that do not generalize well to small sample sizes and less controlled environments. Furthermore, it is also likely that the very large effect size reported in the original study reflect a false positive in the strength of this effect, given the failed replication event on a small sample size that was sufficient according to a priori power analyses. Future research should follow-up and assess the validity of this metric and the relative strength of the effect it can detect at small sample sizes and more generalizable episodic memory scenarios. Additionally, intential comparisons of episodic memory recollection in controlled versus more distracting environments could be useful for understanding how retrieval dependency strength is dampended when individuals may be experiencing attention lapsing or distractions (Madore et al., 2020).
Overall, the failure to replicate this strong effect was not expected but reveals important implications for my future work studying episodic memory retrieval. Specifically, I am skeptical of the sensitivity and validity of this specific metric used to quantify retrieval dependency given the strong evidence in support of hippocampal pattern completion ability in mammalian neural substrate (Marr, 1971; McClelland et al., 1995; Norman & O’Reilly, 2003). I am further interested in uncovering what is required with respect to sample size and environmental conditions to achieve this effect in a less controlled online settings. Further, given that the study method I implemented was nearly identical to that of the original study, I find it most plausible that they key difference between the current replication attempt and the original study was the small sample size recruited online moderated the anticipated effect. The lead author of the original study was well in support and excited by my replication attempt, and I believe that further discussion and dissection of my findings could be useful to further use of these materials and method in the domain of episodic memory encoding and retrieval studies.
The exploratory analyses I conducted are interesting in that retrieval dependency scores significantly decrease with chronological age (in years). This makes sense in light of evidence suggestion that episodic memory ability is impaired with increasing age in older adults (Trelle et al., 2020). This significant effect should be interpreted with caution given that there was not wide sampling distribution across ages and relatively few data points input into the regression model. Furthermore, retrieval dependency data aggregated across all six retrieval blocks (i.e., 72 data points) were used rather than the overall summary retrieval dependency data (i.e., 12 data points) to increase power in the regression model; this linear model using only the 12 summary data points for retrieval dependency was not significant. Thus, larger scale replications of this study should ensure strong representation of each age group to more confidently interpret age effects on retrieval dependency.
Additionally, the exploratory analysis of retrieval accuracy scores predicted by chronological age (in years) was not significant for either the summary (12 measures) or aggregated block (72 measures) data. Caution in interpretation aside, this might suggest that overall memory capacity with increasing age in young adults is relatively stable. Yet, taking into account the decline in retrieval dependency strength from the above analyses might suggest that the ability to distinctly separate overlapping neural pattern during cortical reinstatement of episodic memory traces decreases as we get older but our ability to remember information independent of the context in which it was originally encoded is not as effected. Again and crucially, these interpretations are limited to a very small sample size and number of measures, as well as disparetly sampled age groups that must be interpreted with caution.
One last concern I have with this study relates to the stimulus set used. Figure 6a shows one example of an encoding event with high semantic relatedness (i.e., a holiday scene with snow, a movie character with a scarf, and a holiday-look wrapped gift). This contrasts the type of encoding event shown in Figure 6b, or an encoding event with low semantic relatedness. This stimulus set was originally designed for use with young children for developmental work; however, despite these intentions, the high variability in this stimulus set used by the original authors seems problematic such that highly related encoding events might be easier to string together when forming a story and might be aided during other mnemonic memory strategies – like relying on prior semantic associations – that might inherently boost memory on certain trials and thus not accurately reflect episodic memory associations on the current task. In both the original study and my direct replication attempt, I used the stimuli exactly as were provided: as static images where all episodic encoding event triplets were pre-made and thus not randomly assigned. The only randomization that occurred during encoding was the randomization of the order that these 24 encoding events were presented in. A future study should (1) randomize the contextual episodes, and (2) should use more complex stimuli when assessing pattern completion for episodic memory retrieval in adult participants.
Fig. 6a: Example of an encoding event with high semantic relatedness from the original stimulus set.
Fig. 6b: Example of an encoding event with low semantic relatedness from the original stimulus set.